CharRange and CharIter #111

CAD97 · 2017-08-12T01:30:09Z

Expanded documentation and tests are TODO -- working on those presently.

We can land the patch to use the char range in a different PR.

Expanded documentation and tests are TODO but this should work

CAD97 · 2017-08-12T01:30:26Z

unic/char/range/Cargo.toml

+license = "MIT/Apache-2.0"
+keywords = ["text", "unicode", "iteration"]
+description = "UNIC - CharRange"
+categories = ["text-processing", "iteration"]


Haven't actually checked that this is valid yet.

There's no "iteration" category. Maybe we don't need more than one. Here's the list: https://crates.io/categories/

FYI, I've got "internationalization" (and "localization") added to the list, but it hasn't made it to the live version yet. When that happens, almost all UNIC components get both text-processing and internationalization.

For some reason `allow(private_in_public)` is in the test code in 1.17. This is an error when `forbid(future_incompatible)`. So I made it deny.

CAD97 · 2017-08-12T03:36:15Z

Benchmark results from char-iter:

test benches::count          ... bench:   1,000,194 ns/iter (+/- 34,045)
test benches::count_baseline ... bench:     937,333 ns/iter (+/- 66,945)

Benchmark results from unic-char-range:

test count          ... bench:   2,855,694 ns/iter (+/- 174,285)
test count_baseline ... bench:     941,054 ns/iter (+/- 59,795)

... I've got a bit of work to do don't I ...

CAD97 · 2017-08-12T06:16:52Z

I can not for the life of me figure out why char-iter is better than twice as fast as I am.

I'm going to try re-reimplementing this but closer to char-iter and see if I can't get the better performance.

CAD97 · 2017-08-12T07:37:44Z

I figured out where I was losing time. In my experiments I think I found a cleaner for our uses design though... and it involves fewer moving parts (I think)

behnam

Great work on getting the optimization right! 💯

Let me sleep on the type name and macro syntax. I don't want us to diverge from rustc concepts too much, so it's easier for new users to adapt, as well as putting things back into rustc. Specially, about the macro, I think we better support the common syntax, at least.

The code looks very good. A bunch of nits and a question inline.

behnam · 2017-08-12T08:02:47Z

unic/char/range/Cargo.toml

+repository = "https://github.com/behnam/rust-unic/"
+license = "MIT/Apache-2.0"
+keywords = ["text", "unicode", "iteration"]
+description = "UNIC - CharRange"


Would be better to use more informative language in description, as it's the only text that shows up on crates.io lists and item page, and allows better matching in search.

To follow the pattern from other UNIC components, it can be something like this:

description = "UNIC - Unicode Characters - Character Range and Iteration"

behnam · 2017-08-12T08:11:30Z

unic/char/range/src/iter.rs

+            return None;
+        }
+
+        let char = unsafe { char::from_u32_unchecked(self.low) };


I didn't know it's possible to call a variable char! That's cool! But also uncommon practice, but I don't know why! Anyways, may be safer to just call it ch or chr.

👍 it's a bad habit. It's not something you'd really expect to work, but it does (I think because Rust does type and variable name lookup differently). Surprisingly though, my IDE highlights it right, which is part of why I do it even though I probably shouldn't.

behnam · 2017-08-12T08:15:33Z

unic/char/range/src/iter.rs

+        }
+        let naive_len = self.high as usize - self.low as usize;
+        if self.low <= SURROGATE_RANGE.start && SURROGATE_RANGE.end <= self.high {
+            naive_len - SURROGATE_RANGE.len()


Maybe you can optimize here by putting this number in a const variable, but could be unnecessary. Not sure.

Unfortunately doing so and still calling len() would require const fn to be stable. I can see if precomputing it manually saves any non-negligible amount of time though!

Turns out it takes 2 ns (+/- 0 ns) either way.

behnam · 2017-08-12T08:16:19Z

unic/char/range/src/iter.rs

+
+        // ensure `high` is never one greater than a surrogate code point
+        if self.high == SURROGATE_RANGE.end {
+            self.high = SURROGATE_RANGE.start;


This implementation with the new invariants is even more clean and slim! 👍

You'll like #112 then 😄

behnam · 2017-08-12T08:17:24Z

unic/char/range/src/lib.rs

+//! ```
+//!
+#![forbid(bad_style, missing_debug_implementations, unconditional_recursion)]
+#![deny(missing_docs, unsafe_code, unused, future_incompatible)]


Why unused, future_incompatible are in deny() but not forbid()? Doesn't look like we allow() them anywhere in this crate.

d24057a for future_incompatible. (It doesn't work on 1.17 for some reason.) unused, just because I was allowing it temporarily while building the library up.

Basically, I picked deny for things that could plausibly be temporarily allowed.

I see. I guess it's fine for non-major flags, and we should probably just rely on clippy (in a future version) to hint on all forbidable denys.

behnam · 2017-08-12T08:25:16Z

unic/char/range/src/range.rs

+
+    /// Does this range include a character?
+    #[inline]
+    pub fn contains(&self, char: char) -> bool {


Same about the char variable name.

behnam · 2017-08-12T08:27:41Z

unic/char/range/src/step.rs

@@ -0,0 +1,40 @@
+use std::char;


Since we have a module here called char, I think it would be easier to read the code if we stick with std::char instead of using it as char. What do you think?

If the problem is std::char and the primitive type char conflicting, that's an intended overlap. The char module contains things like char::from_u32 and char::MAX which should act like they are on the primitive char.

If you're saying unic-char, that's not referenced anywhere from my code. The only time there is a char module there is from inside the main unic crate.

Oh, I thought we have a char.rs here next to step.rs. You're right, there's no conflict and it's fine. #aftermidnightreview

behnam · 2017-08-12T08:28:28Z

unic/char/range/src/step.rs

+///
+/// If the given `char` is already `char::MAX`, it is returned unchanged.
+#[allow(unsafe_code)]
+pub fn forward(c: char) -> char {


Using the same variable name here would make is easier to read.

behnam · 2017-08-12T08:32:26Z

unic/char/range/src/step.rs

+/// If the given `char` is already `char::MAX`, it is returned unchanged.
+#[allow(unsafe_code)]
+pub fn forward(c: char) -> char {
+    if c == char::MAX {


So, these checks result in an open_range() to silently create a CharRange with first > last. For example, if we go with open_range('\u{0x20}', '\u{0x21}'), then first == U+21andlast = U+20`. Does this violate any invariants, or we don't care?

(It doesn't look like causing any issues right now, but may be with expansions of the API.)

It's the same with .. syntax: you can create a "backwards" range and it just silently gives you a finished Range. We can put panic! guards around the constructors for improper use, but the standard library here has set a precedent of making an empty range.

Cool! Yeah, as long as the behavior is intended and works similar to std lib, it's fine.

I think it worth adding a hint about it in the constructors, though.

behnam · 2017-08-12T08:34:59Z

unic/char/range/tests/iter_tests.rs

+#[test]
+fn iter_fused() {
+    let mut iter = CharRange::all().iter();
+    let mut fused = all_chars().into_iter().fuse();


I think this needs to be gated behind the fused feature. Surprised that it's passing the CI!

The fn to fuse an iterator is stable. It's merely the trait to declare yourself as fused that is unstable.

behnam · 2017-08-12T08:39:41Z

Btw, also take a look at https://doc.rust-lang.org/std/collections/enum.Bound.html and its usage in https://doc.rust-lang.org/std/collections/range/trait.RangeArgument.html .

CAD97 · 2017-08-12T08:44:47Z

Whoops, replaced it while you were reviewing... Much was shared though so it's not lost effort.

CAD97 · 2017-08-12T09:05:27Z

As for the macro, the syntax was a curveball idea. I'm sticking with just the standard .. and ..= offerings.

behnam

Thanks, @CAD97! A couple of responses here, but I'll put everything possibly actionable on #112, as well.

behnam · 2017-08-12T18:18:55Z

unic/char/range/src/lib.rs

+//! ```
+//!
+#![forbid(bad_style, missing_debug_implementations, unconditional_recursion)]
+#![deny(missing_docs, unsafe_code, unused, future_incompatible)]


I see. I guess it's fine for non-major flags, and we should probably just rely on clippy (in a future version) to hint on all forbidable denys.

behnam · 2017-08-12T18:21:18Z

unic/char/range/src/step.rs

@@ -0,0 +1,40 @@
+use std::char;


Oh, I thought we have a char.rs here next to step.rs. You're right, there's no conflict and it's fine. #aftermidnightreview

behnam · 2017-08-12T18:22:36Z

unic/char/range/src/step.rs

+/// If the given `char` is already `char::MAX`, it is returned unchanged.
+#[allow(unsafe_code)]
+pub fn forward(c: char) -> char {
+    if c == char::MAX {


Cool! Yeah, as long as the behavior is intended and works similar to std lib, it's fine.

I think it worth adding a hint about it in the constructors, though.

112: [char/range] Add CharRange and CharIter r=behnam The first half of adressing #91. Closes #111, this manner of attack is better than it. This PR only has one type, `CharRange`. It is effectively `std::ops::RangeInclusive` translated to characters. The matter of construction is handled by both half-open and closed constructors offered and a macro to allow for `'a'..='z'` syntax.

CharRange and CharIter

018fef0

Expanded documentation and tests are TODO but this should work

CAD97 self-assigned this Aug 12, 2017

CAD97 requested a review from behnam August 12, 2017 01:30

CAD97 commented Aug 12, 2017

View reviewed changes

CAD97 added 4 commits August 11, 2017 22:05

chars! macro

76b29e7

Iterator tests

b823615

Benchmarks

85579ae

Fix 1.17

d24057a

For some reason `allow(private_in_public)` is in the test code in 1.17. This is an error when `forbid(future_incompatible)`. So I made it deny.

CAD97 added 6 commits August 11, 2017 23:37

Tweak bench code

4b63146

CharRange API tweaks

afa38a0

Fix step backward

a8fb34d

Cleanup

868a38b

Expose all four range types in the macro

c3726ef

Remove unused import

b6fb062

Cut execution time in half with one #[inline] :sigh:

d1b9990

CAD97 mentioned this pull request Aug 12, 2017

[char/range] Add CharRange and CharIter #112

Merged

CAD97 closed this Aug 12, 2017

behnam reviewed Aug 12, 2017

View reviewed changes

CAD97 deleted the unic-char-range branch August 13, 2017 08:14

CharRange and CharIter #111

CharRange and CharIter #111

Conversation

CAD97 commented Aug 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CAD97 commented Aug 12, 2017 • edited Loading

CAD97 commented Aug 12, 2017 • edited Loading

CAD97 commented Aug 12, 2017

behnam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

behnam Aug 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

behnam commented Aug 12, 2017

CAD97 commented Aug 12, 2017

CAD97 commented Aug 12, 2017

behnam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CAD97 commented Aug 12, 2017 •

edited

Loading

CAD97 commented Aug 12, 2017 •

edited

Loading

CAD97 commented Aug 12, 2017 •

edited

Loading

behnam Aug 12, 2017 •

edited

Loading